Cheapest day before a flight to buy tickets prediction - Iteration 2¶

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import math
import glob

from sklearn.metrics import  r2_score
In [2]:
import os
os.chdir("../") 
print(os.getcwd())
/Users/bobby/GitHub/Flight-Prices-Predicitons

📦 Data provisioning¶

The flights is gathered from google flights, using a webscraper. It includes data from the routes SOF-EIN , EIN-SOF , EIN-ATH , ISTANBUL-AMS , MUNICH - NEW YORK. The dataset includes data from 03.2025 - 12.2025

In [3]:
csv_files = glob.glob("./datasets/iteration1/*.csv")

original_main_data = pd.concat([pd.read_csv(file, parse_dates=["departureDate", "record_timestamp"], low_memory=False) for file in csv_files], ignore_index=True)

main_data = original_main_data.copy(deep=True)

📃 Sample the data¶

In [4]:
main_data.sample(10)
Out[4]:
daysAgo departureDate price departure_airport arrival_airport is_public_holiday is_school_holiday airline near_holiday record_timestamp
54653 141 2025-07-12 44 New York Washington, D.C. False True Delta -1.0 2025-02-21
35851 222 2025-10-01 97 İstanbul Amsterdam False False Turkish Airlines NaN 2025-02-21
48074 210 2025-10-04 60 Eindhoven Sofia False False Wizzair -1.0 2025-03-08
16685 132 2025-06-07 73 Sofia Eindhoven False False Wizzair NaN 2025-01-26
16053 82 2025-05-27 38 Sofia Eindhoven False False Wizzair 1.0 2025-03-06
30779 133 2025-07-11 72 İstanbul Amsterdam False True Turkish Airlines 1.0 2025-02-28
37266 43 2025-04-09 65 Eindhoven Sofia False False Wizzair NaN 2025-02-25
5401 120 2025-06-21 96 Eindhoven Athens False False Transavia NaN 2025-02-21
51622 72 2025-05-23 51 New York Washington, D.C. False False Delta -1.0 2025-03-12
38141 37 2025-04-24 70 Eindhoven Sofia False False Other 1.0 2025-03-18

🛠️ Preprocessing¶

The step of preprocessing contains several activities to consider before we can start training the algorithm and produce a model that can predict our target variable, in this case the days before a flight, containing the cheapest tickets.

In [5]:
print("Missing values per column:")
print(main_data.isna().sum())
Missing values per column:
daysAgo                  0
departureDate            0
price                    0
departure_airport        0
arrival_airport          0
is_public_holiday        0
is_school_holiday        0
airline                  0
near_holiday         15941
record_timestamp         0
dtype: int64

We can see that half of our data, 15k/30k has missing values on near_holiday. This is to be expected.

Adding new feature - distance between departure and arrival airports¶

Using the following code, we can calculate the distance between the two airport's coordinates, which will be a useful feature for our model.

In [6]:
airport_coords = {
    'New York': (40.7128, -74.0060),
    'Amsterdam': (52.3676, 4.9041),
    'Athens': (37.9838, 23.7275),
    'Eindhoven': (51.4416, 5.4697),
    'Sofia': (42.6975, 23.3242),
    'Washington, D.C.': (38.8951, -77.0364),
    'İstanbul': (41.0082, 28.9784)
}

# Function to calculate Haversine distance
# Source: https://stackoverflow.com/questions/25711895/the-result-by-haversine-formula-is-meter-o-kmeter
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Radius of Earth in km
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)

    a = math.sin(delta_phi / 2.0) ** 2 + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2.0) ** 2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    return R * c  # Distance in km

def calculate_distance(row):
    dep = row['departure_airport']
    arr = row['arrival_airport']
    if dep in airport_coords and arr in airport_coords:
        lat1, lon1 = airport_coords[dep]
        lat2, lon2 = airport_coords[arr]
        return haversine_distance(lat1, lon1, lat2, lon2)
    else:
        return None  # Handle missing airport coordinates

main_data['airport_distance_km'] = main_data.apply(calculate_distance, axis=1)
main_data.sample(10)
Out[6]:
daysAgo departureDate price departure_airport arrival_airport is_public_holiday is_school_holiday airline near_holiday record_timestamp airport_distance_km
44366 198 2025-08-05 70 Eindhoven Sofia False True Ryanair -1.0 2025-01-19 1658.335311
33809 185 2025-08-29 138 İstanbul Amsterdam False True Turkish Airlines 1.0 2025-02-25 2211.947562
47881 217 2025-10-01 57 Eindhoven Sofia False False Wizzair NaN 2025-02-26 1658.335311
3834 75 2025-05-26 132 Eindhoven Athens False False Transavia 1.0 2025-03-12 2067.423123
37122 63 2025-04-07 67 Eindhoven Sofia False False Wizzair NaN 2025-02-03 1658.335311
51518 114 2025-05-22 91 New York Washington, D.C. False False Delta -1.0 2025-01-28 328.393017
53735 129 2025-06-27 62 New York Washington, D.C. False False Delta NaN 2025-02-18 328.393017
11771 255 2025-10-05 104 Eindhoven Athens False False Transavia -1.0 2025-01-23 2067.423123
23273 240 2025-09-23 64 Sofia Eindhoven False False Wizzair 1.0 2025-01-26 1658.335311
10524 200 2025-09-14 104 Eindhoven Athens False False Transavia NaN 2025-02-26 2067.423123

The code performs feature engineering and encoding to prepare the data for machine learning. It converts departureDate and record_timestamp into Unix timestamps for numerical processing. The near_holiday column is one-hot encoded to avoid misinterpretation of its -1, 0, and 1 values as ordered. It also extracts the weekday from the departureDate as a new feature. Finally, categorical variables such as airline, departure_airport, and arrival_airport are converted to numeric labels using label encoding

In [7]:
# Feature engineering
main_data['departure_date_unix'] = main_data['departureDate'].astype(np.int64) // 10**9
main_data['record_timestamp_unix'] = main_data['record_timestamp'].astype(np.int64) // 10**9
main_data = pd.get_dummies(main_data, columns=['near_holiday']) # as the columns is -1, 0, 1 which messes with the algorithm
main_data['departure_weekday'] = main_data['departureDate'].dt.weekday

# Encode categorical variables
le_dep = LabelEncoder()
le_arr = LabelEncoder()
le_airline = LabelEncoder()
main_data['airline'] = le_airline.fit_transform(main_data['airline'])
main_data['departure_airport'] = le_dep.fit_transform(main_data['departure_airport'])
main_data['arrival_airport'] = le_arr.fit_transform(main_data['arrival_airport'])
In [8]:
def plot_avg_price_per_day(df, x_col='daysAgo', y_col='price'):
    avg_per_day = df.groupby(x_col)[y_col].mean().reset_index().sort_values(by=x_col)

    plt.figure(figsize=(12, 6))
    plt.plot(avg_per_day[x_col], avg_per_day[y_col], marker='o', linestyle='-', color='red')
    plt.title('Average Price per Day Before Departure')
    plt.xlabel('Days Before Departure')
    plt.ylabel('Average Price (€)')
    plt.grid(True)
    plt.tight_layout()
    plt.show()
plot_avg_price_per_day(main_data)
No description has been provided for this image

The chart reveals that flight prices are highest very close to the departure date and tend to drop significantly when booked around 30–90 days in advance. Prices then gradually increase again when booking far in advance, especially beyond 120 days. This pattern suggests that the cheapest tickets are typically available when booking 1–3 months before departure

Marking the cheapest price for each flight¶

In [9]:
# Step 1: Group and find the cheapest record for each flight
cheapest_rows = main_data.loc[main_data.groupby(
    ['departureDate', 'departure_airport', 'arrival_airport']
)['price'].idxmin()] #Return the row label of the minimum value

# Step 2: Create a mapping from flight to its cheapest daysAgo
cheapest_map = cheapest_rows.set_index(
    ['departureDate', 'departure_airport', 'arrival_airport']
)['daysAgo'].to_dict()

# Step 3: Map it back to the full data
main_data['cheapest_day_future'] = main_data.apply(
    lambda row: cheapest_map.get((row['departureDate'], row['departure_airport'], row['arrival_airport'])),
    axis=1
)

# Drop rows where mapping failed (e.g. missing future prices)
main_data.dropna(subset=['cheapest_day_future'], inplace=True)
main_data['cheapest_day_future'] = main_data['cheapest_day_future'].astype(int)

💡 Feature Selection¶

Now we will create several graphs to visualize relationships between the features of the dataset

In [10]:
import seaborn as sns
import matplotlib.pyplot as plt

correlations = main_data.corr()

plt.figure(figsize=(min(20, 0.8 * len(correlations)), min(20, 0.8 * len(correlations))))  

sns.heatmap(
    correlations, 
    annot=True, 
    fmt=".2f",
    linewidths=0.5,
    cmap="coolwarm",
)

plt.title("All Feature Correlations Heatmap", fontsize=16)
plt.show()
No description has been provided for this image
In [11]:
correlation_target = main_data.corr()['cheapest_day_future'].sort_values(ascending=False)
print(correlation_target)
cheapest_day_future      1.000000
departure_date_unix      0.943693
departureDate            0.943693
daysAgo                  0.901768
is_school_holiday        0.245841
near_holiday_1.0         0.217296
price                    0.187732
airport_distance_km      0.142290
near_holiday_-1.0        0.073063
near_holiday_0.0         0.040271
airline                  0.026380
departure_airport        0.011745
record_timestamp        -0.000653
record_timestamp_unix   -0.000653
departure_weekday       -0.004502
is_public_holiday       -0.123838
arrival_airport         -0.142831
Name: cheapest_day_future, dtype: float64

The features most correlated with cheapest_day_future are daysAgo, departure_date_unix, and record_timestamp_unix, indicating that time-related variables play the biggest role in predicting the cheapest booking day. Other features like airport_distance_km, near_holiday flags, and airline have low correlation and may contribute little predictive power individually.

Selecting features and target¶

In [12]:
features = [
    'price', 'airport_distance_km',
    'near_holiday_-1.0', 'near_holiday_0.0', 'near_holiday_1.0',
    'departure_airport', 'arrival_airport',
    'daysAgo', 'departure_weekday'
]

target = 'cheapest_day_future'

X = main_data[features]
y = main_data[target]
In [13]:
import seaborn as sns
import matplotlib.pyplot as plt

correlations = main_data[features].corr()

plt.figure(figsize=(min(20, 0.8 * len(correlations)), min(20, 0.8 * len(correlations))))  

sns.heatmap(
    correlations, 
    annot=True, 
    fmt=".2f",
    linewidths=0.5,
    cmap="coolwarm",
)

plt.title("Most Valuable Feature Correlations Heatmap", fontsize=16)
plt.show()
No description has been provided for this image

The heatmap shows that most features have low correlations, indicating they contribute distinct information to the model. Airport_distance_km and arrival_airport have the strongest relationships, with arrival_airport showing a strong negative correlation with both distance and price. This suggests that certain arrival airports and longer distances tend to be associated with higher ticket prices

In [14]:
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import LabelEncoder

X_encoded = X.copy()
for col in X_encoded.select_dtypes(include='object').columns:
    X_encoded[col] = LabelEncoder().fit_transform(X_encoded[col])

# Calculate mutual information ("amount of information" obtained about one random variable by observing the other random variable)
mi_scores = mutual_info_regression(X_encoded, y)
mi_series = pd.Series(mi_scores, index=X_encoded.columns).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
mi_series.plot(kind='barh')
plt.title('Mutual Information with Target (cheapest_day_future)')
plt.xlabel('Mutual Information Score')
plt.gca().invert_yaxis()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The feature with the highest mutual information score is price, indicating it provides the most information about the target (cheapest_day_future). Other important features include departure_weekday, daysAgo, and arrival_airport, all showing moderate relevance. Features related to holidays have the lowest scores, suggesting that proximity to holidays has minimal influence on the model’s prediction.

🪓 Splitting into train/test¶

80% of the data is used for training, and 20% for testing

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 59996 observations, of which 47996 are now in the train set, and 12000 in the test set.

🧬 Modelling¶

Previously used algorhitm Linear Regression¶

In [16]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

Adding the last learned algorhitm RandomForestRegression to compare the algorhitms

In [17]:
rfr = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rfr.fit(X_train, y_train)
rfr_pred = rfr.predict(X_test)

Now lets visualize part of the decision tree to see how it actually works from the inside

In [18]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

tree = rfr.estimators_[0] 

plt.figure(figsize=(40, 20))
plot_tree(tree, feature_names=X.columns, filled=True, rounded=True, max_depth=3)
plt.title("Random Forest - Tree 0 (first 3 levels)")
plt.show()
No description has been provided for this image

This decision tree from the Random Forest model shows that daysAgo is the primary splitting feature, indicating it’s the most influential factor in predicting when a ticket is cheapest. Other important splits involve price, arrival_airport, and airport_distance_km, which refine the prediction based on flight specifics and route characteristics. While near_holiday appears at a deeper node, its limited presence suggests a weaker influence compared to the time-related and location-based features

In [19]:
from supertree import SuperTree

st = SuperTree(
    rfr,                
    X_train.values,     
    y_train,            
    list(X_train.columns), 
    "cheapest_day_future"      
)

# Show the first tree at start
st.show_tree(which_tree=0)
×

Using the SuperTree library, we can interactively follow the decission of the RandomForestResgression and we can dynamically change the depth, zoom in or out and click on the generated charts, which makes it easier to follow the decision boundaries

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

features = X.columns 

# Get feature importances
importances = pd.Series(rfr.feature_importances_, index=features)

# Get standard deviation across all decision trees
std_dev = np.std([tree.feature_importances_ for tree in rfr.estimators_], axis=0)

# Plot
plt.figure(figsize=(10, 6))
importances.sort_values().plot.barh(xerr=std_dev[np.argsort(importances)], color='teal', alpha=0.8)
plt.title("Feature Importance in Random Forest Regressor")
plt.xlabel("Mean Decrease in Impurity")
plt.tight_layout()
plt.grid(True)
plt.show()
No description has been provided for this image

The feature importance plot shows that daysAgo is by far the most influential variable, contributing the most to the model’s predictive performance. Other features like price, arrival_airport, and airport_distance_km have minor but non-negligible impact, while the remaining features contribute very little. This suggests the timing before departure is the most critical factor in predicting the cheapest day to buy a flight.

🔬 Evaluation¶

In order to shed some light on the results a classification report can be printed.

In [21]:
# Calculate evaluation metrics
r2_lr = r2_score(y_test, lr_pred)
r2_rfr = r2_score(y_test, rfr_pred)

print(f"R² Score LR: {r2_lr}")
print(f"R² Score RFR: {r2_rfr}")
R² Score LR: 0.8426295914370998
R² Score RFR: 0.9381802268368378

The random forest give a very nice accuracy result, ill try to use boosting to improve the score for the Linear Regression

In [22]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

# Scatter plot: Actual vs Predicted
plt.scatter(y_test, lr_pred, alpha=0.5, color='orange', edgecolors='k', label='Linear Regression')
plt.scatter(y_test, rfr_pred, alpha=0.5, color='red', edgecolors='k', label='Random Forest Regressor')

# Add a reference line (perfect predictions)
min_val = min(min(y_test), min(lr_pred), min(rfr_pred))
max_val = max(max(y_test), max(lr_pred), max(rfr_pred))
plt.plot([min_val, max_val], [min_val, max_val], color='gray', linestyle='--', label='Perfect Prediction')

# Labels, title, legend
plt.xlabel('Actual cheapest_day_future')
plt.ylabel('Predicted cheapest_day_future')
plt.title('Actual vs Predicted Cheapest Days to Buy (cheapest_day_future) for LR and RFR')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

After removing a data leak from one of the columns, we see that both algorhitms have some differences in reference to the actual values. This is to be expected as the correlations are not that high. Now however, with the slightly lower accuracy we can try to add the AdaBoost and/or Stacking

In [23]:
import pandas as pd

comparison_df = pd.DataFrame({
    'Actual': y_test.reset_index(drop=True)[:20],
    'Linear Regression': lr_pred[:20].round(2),
    'Random Forest': rfr_pred[:20].round(2),
})

# Add residuals
comparison_df['LR Residual'] = (comparison_df['Actual'] - comparison_df['Linear Regression']).round(2)
comparison_df['RFR Residual'] = (comparison_df['Actual'] - comparison_df['Random Forest']).round(2)

# Display
print(comparison_df)
    Actual  Linear Regression  Random Forest  LR Residual  RFR Residual
0      210             213.82         205.21        -3.82          4.79
1      207             177.40         204.54        29.60          2.46
2      249             212.52         249.00        36.48          0.00
3      184             186.96         177.45        -2.96          6.55
4      202             193.78         200.56         8.22          1.44
5      180             138.09         159.76        41.91         20.24
6      159             137.97         159.38        21.03         -0.38
7      190             201.97         192.23       -11.97         -2.23
8      215             205.15         230.26         9.85        -15.26
9       42              62.24          45.78       -20.24         -3.78
10      85              60.17          42.97        24.83         42.03
11     165             154.19         151.36        10.81         13.64
12     192             199.31         191.03        -7.31          0.97
13     209             207.50         193.18         1.50         15.82
14     187             161.18         186.49        25.82          0.51
15      92              73.18          95.95        18.82         -3.95
16     211             189.99         209.79        21.01          1.21
17     133             131.12         137.82         1.88         -4.82
18      86              71.82          85.98        14.18          0.02
19     207             214.41         185.41        -7.41         21.59

The residuals table reveals that the Random Forest Regressor (RFR) generally produces smaller errors than the Linear Regression (LR) model, especially in cases with larger deviations (e.g., rows 1, 2, 5, 10). The LR model tends to underpredict or overpredict more severely, as seen by higher residuals in several rows. This confirms that the Random Forest model captures non-linear patterns in the data more effectively, leading to improved accuracy.

In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

def compare_lr_vs_rf(X_train, X_test, y_train, y_test, max_depth=None, yLim=None):
    rf_train_scores, rf_test_scores = [], []
    lr_train_scores, lr_test_scores = [], []

    estimators_range = range(10, 211, 20)

    for n in estimators_range:
        rf = RandomForestRegressor(n_estimators=n, max_depth=max_depth, random_state=21, n_jobs=-1)
        rf.fit(X_train, y_train)
        rf_train_scores.append(rf.score(X_train, y_train))
        rf_test_scores.append(rf.score(X_test, y_test))

        lr = LinearRegression()
        lr.fit(X_train, y_train)
        lr_train_scores.append(lr.score(X_train, y_train))
        lr_test_scores.append(lr.score(X_test, y_test))

    plt.figure(figsize=(12, 6))

    # Plot curves
    plt.plot(estimators_range, rf_train_scores, marker='o', linestyle='--', label=f'RF Train (max_depth={max_depth})', linewidth=2)
    plt.plot(estimators_range, rf_test_scores, marker='o', label=f'RF Test (max_depth={max_depth})', linewidth=2)
    plt.plot(estimators_range, lr_train_scores, marker='s', linestyle='--', label='LR Train (constant)', linewidth=2)
    plt.plot(estimators_range, lr_test_scores, marker='s', label='LR Test (constant)', linewidth=2)

    plt.xlabel('Number of Estimators (for RF only)')
    plt.ylabel('R² Score')
    plt.title('Train vs Test: Random Forest vs Linear Regression')
    plt.grid(True)
    plt.xlim(estimators_range[0], estimators_range[-1])

    # === Optional Y-axis lower limit ===
    if yLim is not None:
        plt.ylim(bottom=yLim)

    plt.legend()
    plt.tight_layout()
    plt.show()
In [25]:
compare_lr_vs_rf(X_train, X_test, y_train, y_test, max_depth=10)
No description has been provided for this image

Random Forest (max_depth=10):

  • Train R² ≈ 0.990–0.992, very high — suggesting near-perfect fit on training data.
  • Test R² ≈ 0.989–0.990, almost identical to train — indicating no overfitting and strong generalization.
  • Increasing the number of estimators improves consistency but brings minimal gain after ~50 estimators.

Linear Regression:

  • Flat performance across all points (as expected, since it’s not affected by n_estimators).
  • Train & Test R² ≈ 0.918, consistently lower than Random Forest, meaning it underfits slightly and misses non-linear patterns in the data.

After discussions with teachers, they suggested that the chart should start at Y-0 as to not give false impressions as both alhorhitms have less than 10% difference in accuracy but the charts makes a false impression.

In [26]:
compare_lr_vs_rf(X_train, X_test, y_train, y_test, max_depth=10, yLim=0)
No description has been provided for this image

Boosting¶

Lastly, we can try to boost our results to near perfection using the knowledge from the Optimization Lecture - Boosting / Stacking

In [27]:
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

def compare_rf_vs_adaboost(X_train, X_test, y_train, y_test, max_depth, learning_rate=0.5, yLim = None):
    rf_train_scores, rf_test_scores = [], []
    ada_train_scores, ada_test_scores = [], []

    estimators_range = range(10, 211, 20)

    for n in estimators_range:
        # === Random Forest ===
        rf = RandomForestRegressor(n_estimators=n, max_depth=max_depth, random_state=21, n_jobs=-1)
        rf.fit(X_train, y_train)
        rf_train_scores.append(rf.score(X_train, y_train))
        rf_test_scores.append(rf.score(X_test, y_test))

        # === AdaBoost  ===
        ada = AdaBoostRegressor(
            estimator=DecisionTreeRegressor(max_depth=max_depth),
            n_estimators=n,
            learning_rate=learning_rate,
            random_state=21
        )
        ada.fit(X_train, y_train)
        ada_train_scores.append(ada.score(X_train, y_train))
        ada_test_scores.append(ada.score(X_test, y_test))

    # === Plot Results ===
    plt.figure(figsize=(12, 6))
    
        # === Optional Y-axis lower limit ===
    if yLim is not None:
        plt.ylim(bottom=yLim)

    # Random Forest
    plt.plot(estimators_range, rf_train_scores, marker='o', linestyle='--', label=f'RF Train (max_depth={max_depth})', linewidth=2)
    plt.plot(estimators_range, rf_test_scores, marker='o', label=f'RF Test (max_depth={max_depth})', linewidth=2)

    # AdaBoost
    plt.plot(estimators_range, ada_train_scores, marker='s', linestyle='--', label=f'AdaBoost Train (max_depth={max_depth})', linewidth=2)
    plt.plot(estimators_range, ada_test_scores, marker='s', label=f'AdaBoost Test (max_depth={max_depth})', linewidth=2)

    plt.xlabel('Number of Estimators')
    plt.ylabel('R² Score')
    plt.title('Train vs Test: Random Forest vs AdaBoost')
    plt.grid(True)
    plt.xlim(estimators_range[0], estimators_range[-1])
    plt.legend()
    plt.tight_layout()
    plt.show()
In [28]:
compare_rf_vs_adaboost(X_train, X_test, y_train, y_test, max_depth=10, yLim=0)
No description has been provided for this image

When coparing the test vs train on RF and AdaBoost with the same arguments, we can see that both of them are performing really well, however, lets zoom in to see if there is any meaning full difference.

In [29]:
compare_rf_vs_adaboost(X_train, X_test, y_train, y_test, max_depth=10)
No description has been provided for this image

With a max depth of 10, both Random Forest and AdaBoost show similar test R² performance. However, AdaBoost slightly outperforms Random Forest in test accuracy up to ~100 estimators before declining, indicating potential overfitting. Random Forest maintains more stable performance across the full estimator range. Still, I'll use the random forest as I want robust and reliable performance with less risk of overfitting

In [30]:
from sklearn.ensemble import StackingRegressor
# Try new base models
base_models = [
    ('rf', RandomForestRegressor(n_estimators=100, max_depth=10, random_state=21)),
    ('lr', LinearRegression()),
]

# Meta-model
meta_model = LinearRegression()

# Stacking
stack_model = StackingRegressor(
    estimators=base_models,
    final_estimator=meta_model,
    n_jobs=-1
)

stack_model.fit(X_train, y_train)

from sklearn.metrics import r2_score

y_pred_stack = stack_model.predict(X_test)
r2_stack = r2_score(y_test, y_pred_stack)

print("New Stacking Model R²:", r2_stack)
New Stacking Model R²: 0.9389904109448793

Before the stacking, accuracy was around 0.9381. Witch just an increase of 0.0007 I dont think its worth to do the stacking and just use the nice base fine tuned RFR

Inference¶

In [31]:
# Example user input
from datetime import datetime

departure_date = datetime(2025, 8, 31)
record_date = datetime(2025, 4, 25)
daysAgo_input = (departure_date - record_date).days

sample_input = pd.DataFrame([{
    'price': 208,
    'airport_distance_km': 2000,
    'near_holiday_-1.0': 0,
    'near_holiday_0.0': 0,
    'near_holiday_1.0': 0,
    'departure_airport': le_dep.transform(['Sofia'])[0],
    'arrival_airport': le_arr.transform(['Eindhoven'])[0],
    'daysAgo': daysAgo_input,
    'departure_weekday': departure_date.weekday()
}])

user_input = sample_input[features]
user_pred = rfr.predict(user_input)[0]
top3_user_preds = np.round([user_pred - 1, user_pred, user_pred + 1]).astype(int)

print("Top 3 estimated best days before departure to buy:")
print(top3_user_preds)
Top 3 estimated best days before departure to buy:
[140 141 142]

The algorhitms seems to perform well enough to call the project a success!

In [40]:
import joblib

# Save your classifier
joblib.dump(rfr, './web-app/FlightPredictionsWebApp/models/flight_model.pkl')

# Save encoders if needed
joblib.dump(le_dep, './web-app/FlightPredictionsWebApp/models/departure_encoder.pkl')
joblib.dump(le_arr, './web-app/FlightPredictionsWebApp/models/arrival_encoder.pkl')
Out[40]:
['./web-app/FlightPredictionsWebApp/models/arrival_encoder.pkl']

image-2.png

✈️ Summary – Iteration 3¶

What went well:

  • Switched to regression instead of classification, directly predicting the cheapest daysAgo value.
  • Introduced more advanced models: StackingRegressor (Random Forest + Linear Regression), with improved R² scores.
  • Conducted hyperparameter tuning and model comparison (RandomForest, KNN, LR) showing consistent evaluation.

What didn’t go well:

  • Some models still showed signs of overfitting (high training vs lower test R²).
  • Data imbalance issues and noise in certain ranges of days may have affected accuracy.

What changed:

  • Shifted from categorical buckets (daysAgo_category) to predicting actual daysAgo values.
  • Added features like departureDay, departureMonth, daysAgo, holiday flags, and timestamp-based data.
  • Switched evaluation focus from classification metrics to regression metrics like R² and RMSE.

In [38]:
from datetime import datetime, timedelta

# Function to generate features for your model
def features_generator(flight_date, purchase_date, current_price, dep_airport_encoded, arr_airport_encoded, airport_distance_km, near_holiday_flags):
    daysAgo = (flight_date - purchase_date).days
    departure_weekday = flight_date.weekday()

    # Correct feature vector: with today's known price
    features = [
        current_price,
        airport_distance_km,
        near_holiday_flags.get(-1.0, 0),
        near_holiday_flags.get(0.0, 0),
        near_holiday_flags.get(1.0, 0),
        dep_airport_encoded,
        arr_airport_encoded,
        daysAgo,
        departure_weekday
    ]
    return features

# Function to find the best future purchase day
def find_best_future_daysAgo(model, flight_date, today_date, current_price, dep_airport_encoded, arr_airport_encoded, airport_distance_km, near_holiday_flags):
    """
    Find the best daysAgo to buy a ticket, considering only today -> flight date.

    Parameters:
    - model: trained machine learning model
    - flight_date: datetime object
    - today_date: datetime object
    - current_price: float (known price today)
    - dep_airport_encoded: encoded departure airport
    - arr_airport_encoded: encoded arrival airport
    - airport_distance_km: distance between airports in km
    - near_holiday_flags: dict like {-1.0: 0, 0.0: 0, 1.0: 0}

    Returns:
    - best_daysAgo (int): best number of days before departure
    - best_prediction (float): best predicted daysAgo value
    """
    best_daysAgo = None
    best_prediction = float('inf')

    days_until_flight = (flight_date - today_date).days

    for daysAgo_candidate in range(days_until_flight, -1, -1):  # from today down to flight day
        candidate_purchase_date = flight_date - timedelta(days=daysAgo_candidate)

        if candidate_purchase_date < today_date:
            continue  # skip past dates

        features = features_generator(
            flight_date,
            candidate_purchase_date,
            current_price,
            dep_airport_encoded,
            arr_airport_encoded,
            airport_distance_km,
            near_holiday_flags
        )

        prediction = model.predict([features])[0]

        if prediction < best_prediction:
            best_prediction = prediction
            best_daysAgo = daysAgo_candidate

    return best_daysAgo, best_prediction

# Example usage:
flight_date = datetime(2025, 7, 5)
today_date = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)

current_price = 123  # Known today
dep_airport_encoded = le_dep.transform(['Eindhoven'])[0]
arr_airport_encoded = le_arr.transform(['Sofia'])[0]
airport_distance_km = 2200
near_holiday_flags = {-1.0: 0, 0.0: 0, 1.0: 0}

best_daysAgo, best_prediction = find_best_future_daysAgo(
    rfr,  # your model
    flight_date,
    today_date,
    current_price,
    dep_airport_encoded,
    arr_airport_encoded,
    airport_distance_km,
    near_holiday_flags
)

print(f"✅ Best future day to buy: {best_daysAgo} days before departure")
print(f"✅ Predicted best daysAgo value: {best_prediction:.2f}")
✅ Best future day to buy: 40 days before departure
✅ Predicted best daysAgo value: 48.05
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(
/Users/bobby/GitHub/Flight-Prices-Predicitons/myenv/lib/python3.13/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
  warnings.warn(